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1. INTRODUCTION 

Several parts of a plant can be used by a botanist in order to recognize a plant. This includes 
flowers, leaves, and roots. However, leaves are the most widely used as it is more convenient to be used and 
the results are great [1]. The purpose of identifying plants is to categorize the plants for recording purposes. 
The process of identifying a plant using leaves is an easy task for botanists as they can simply recognize it 
using their senses [2]. On the contrary, for machines to achieve the same recognition results requires 
performing image-processing techniques to extract visual information and compare them to existing sets of 
data [3]. Structured learning or better known as deep learning, has been recognized as a new area in computer 
vision that has been reported to produce excellent results [4]. 

Deep learning, a class of machine-learning techniques used to extract characteristics of data, and 
CNN (Network Neural Convolutional), a series of artificial neural networks that have been expanded into 
space using shared weights, have been found to be suitable for computer vision tasks [5]. Within the past few 
years, deep learning algorithms particularly convolutional neural networks (CNN) have proven their much 
powerful feature representation power in computer vision [6]. 

Convolutional Neural Network has resulted in ground breaking decisions over the last decade in 
various fields related to pattern recognition; from image processing to voice recognition [7]. CNN’s 
capabilities have become a known and used in voice analysis [8], image classification [9], scene 
classification [10], vehicle recognition [11], fruit classification and ripeness grading recognition [12], and 
food recognition [13]. 

Bag of Features (BoF) model is also widely implemented in image processing and classification 
tasks [14]. It is a machine learning model and an integration of Bag of Words (BoW) which makes it suitable 
for image classification [15], for example, in breast histopathology images recognition and Arabic 


Journal homepage: http://iaescore.com/journals/index.php/ijeecs 


328 0 ISSN: 2502-4752 


handwritten word recognition [16]. In this paper, a comparison between CNN and BoF is being conducted in 
order to analyze the accuracy performance of leaf recognition between the two models. 

This paper presents evaluation of basic Convolutional Neural Network and Bag of Features for Leaf 
Recognition. In this study, comparison is made to determine which model is the most appropriate to 
recognize the leaf from Folio dataset. Research about leaf recognition has been conducted by several 
researchers using various techniques. One technique used is Support Vector Machine (SVM) with texture 
features and the result achieved is 99% accuracy [2]. With data augmentation, the accuracy of 99.04% using 
AlexNet and 99.42 using GoogleNet are obtained [16]. Besides that, shape features and colour histogram 
with k-nearest neighbour classifiers have been applied with 87.2% accuracy [3]. Since the results from using 
this dataset has been very positive, this dataset has been chosen to be experimented in this research. 


2. RESEARCH METHOD 
2.1. Bag of Features (BoF) 

Bag-of-Features (BoF) represents the images by instances of local features extracted from the 
image. This framework can be perceived into a two-level framework. The first level is associated to the pixel 
intensity of the image and extraction of local features type. For the second level, it consists of two part which 
are encoding and pooling [16]. The encoding part converted local features into code books, in which the most 
representative visual vocabulary patterns are coded as visual or code words using codebook learning. Then, a 
histogram or feature vector is produced through an easy frequency analysis of each codeword inside that 
image in the pooling part [15]. Figure 1. shows the general structure of BoF framework. 

In this project, Speeded up Robust Features (SURF) has been used in BoF because the performance 
of this feature is excellent and only require low computational cost [18]. It is an image detector and 
descriptor that is based on Hessian matrix measures. It uses a 2D Haar wavelet transform for a descriptor that 
uses only 64 dimensions leading to quick feature extraction [18]. 

For BoF training, the strongest features from each category are set to 80 percent. Based on the 
result, the average accuracy is 0.85 which shows that by using this method, the accuracy is more than 80 
percent. The clustering of the data has been completed on the 20th iteration in which it is about 4.39 
seconds/iteration. Figure 2 shows the Visual Word graph for BoF based on the data used in this project. 
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Figure |. General structure of BoF framework [16] Figure 2. Visual Word graph based on the BoF in 
this project 


Literature review that has been done author used in the chapter "Introduction" to explain the 
difference of the manuscript with other papers, that it is innovative, it are used in the chapter "Research 
Method" to describe the step of research and used in the chapter "Results and Discussion" to support the 
analysis of the results [2]. If the manuscript was written really have high originality, which proposed a new 
method or algorithm, the additional chapter after the "Introduction" chapter and before the "Research 
Method" chapter can be added to explain briefly the theory and/or the proposed method/algorithm [4]. 


2.2. Convolutional Neural Network (CNNs) 


Convolutional Neural Network (CNNs) consists of four types of layers which are convolution 
layers, pooling layer, Rectified Linear unit (ReLu) layer and fully connected layers. Convolution layers 
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extract the input of an image by using convolution operation and produce a feature map [1]. Multiple 
convolutional layers can be applied for different feature maps as well. This method is to ensure complete 
extraction of various features. Next, pooling layer lower the size of the feature maps. This process makes the 
input robust against noise and distortion [6]. Neural networks and CNN particularly rely on the third layer 
which is the activation function. CNN may use specific functions such as ReLUs functions to efficiently 
implement non-liner triggering. All negative pixel values in the feature map are replaced with zero by using 
ReLu layer [3]. Fully connected layer which is the last layer, total the weightage of previous layer of features 
to determine the output. 

Figure 3 shows the CNN architecture that extracts features by using convolution technique on the 
input image, resize the feature map during pooling layer and classifies it in the fully connected layer. The 
first convolution layer usually extracts the low-level features such as edges while the second convolution 
layer extracts the high-level features such as the shape. 
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Figure 3. Structure of Basic CNN [19] 


3. RESULTS AND ANALYSIS 

The laptop used to run the CNN and BoF for this project was Lenovo with Windows 10 as the 
operating system, Intel Core 17 processor, and an 8.00 GB RAM while the software used was Matlab 2018a. 
The dataset used is Folio Leaf Data Set [19]. Leaves pictures are taken from plants on the farm of the 
University of Mauritius and nearby locations. There are 32 categories of plant and for each category 20 
images of leaves are experimented. All the images are resized into 224 by 224 pixels to ensure the 
consistency of the data for each method. Figure 4 shows sample images of Folio Leaf dataset from all the 32 
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Figure 4. Sample images of Folio Leaf dataset 


Experiments were conducted by changing the number of layers, the values of the parameters in the 
convolve layer, pooling layer and the learning rate. The purpose is to determine the best combination of 
values to produce the highest accuracy for leaf recognition from Folio dataset. The result of the experiments 
was recorded in Table 1. By referring to Table 1. the first column indicates the number of stacks of layers 
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where a stack consists of one convolve layer, one max-pooling layer and one ReLu layer. In column 
Convolve layer, the first number in the square bracket represents the size of the convolve filter while the 
second number represents the number of convolve filters. The third layer represents the size of max-pooling 
filter and the number of stride. The number of epoch and learning rate is shown in column 4. The number of 
epoch determines the number of repetitions of all the training data while the learning rate is the amount of 
adjustment that is being made to the weights during the training process. By looking at Table 1. the best 
accuracy is achieved when there are three stacks of layers, but the accuracy starts to decrease when the 
number of stacks is more than three. 


Table 1. Experimental results on parameter tuning for basic CNN 
No of Stack Pooling layer and Epoch, Learning 


af Layers Convolve Layer Stride Rate Accuracy (%) Total Time/s 
1 [3,16] 3 10, 0.001 71.92 6min 32s 
[5,20] 2 10, 0.0001 65.62 8min 23s 
) [3,16], [3,16] 3 10, 0.001 79.81 Amin 27s 
[3,80], [3,64] 2 10, 0.001 TA.82 15min 40s 
3 [3,16], [3,16], [3, 32] 3 10, 0.001 76.66 Amin 44s 
[5,20], [3, 20], [3, 16] 3 10, 0.001 82.03 3min 35s 


Figure 5. shows training progress for convolve layer [5,20], [3,20], [3,16], pooling layer and stride 
3, epoch 10 and learning rate 0.001. By using these parameter values the accuracy result reached at 82.03 
with elapsed time of 3 minutes and 35 seconds which is faster compare to the other results. These layers and 
parameters is much more accurate compare to others. 
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Figure 5. Training Progress 


Table 2. Shows an overview of the accuracy performance of basic CNN compared to BoF based on 
Folio Leaf Dataset. By looking at Table 2, we can see that BoF is better than basic CNN but it took a longer 
time to achieve this result. This is because extracting of the SURF features is longer compare to the time to 
extract the low-level and middle level features by the basic CNN. 


Table 2. The Performance overview for basic CNN and BofF for FolioLeaf Dataset 


Model Basic CNN BoF 
Validation Accuracy 0.82 0.85 
Elapsed Time (s) 177 276 
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4. CONCLUSION 

This paper has presented the evaluation of basic CNN and BoF for leaf recognition. In this paper, 
the accuracy performance for leaf recognition based on Folio leaf dataset is compared between basic CNN 
and BoF. The experimental results show that basic CNN achieves a lower accuracy rate compared to BoF 
since it requires a huge amount of data compared to BoF. Therefore, if the number of data is limited, BoF 
still provides a good result and preferable compared to CNN. For the future research, we plan to enhance the 
CNN architecture and increases the number of datasets to obtain a more accurate and faster results. 
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